find_car_make <- function(car_name){
make <- str_extract(string = car_name,
pattern = "[:alpha:]*")
return(make)
}Writing Data Frame / Plotting Functions
Wednesday, May 15
Today we will…
- Discuss Groupt Project + Group Contract
- New Material
- Calling Functions on Datasets
- Thinking About Missing Data
- Lab 7: Functions and Fish
Group Project Details
Check out the Canvas page outlining the group project!
- Groups have been assigned.
- Your group contract is due on Monday!
Calling Functions on Datasets
Last Time…
We wrote a function called find_car_make() that takes in the name of a car and returns the “make” of the car (the company that created it).
find_car_make("Toyota Camry")returns “Toyota”.find_car_make("Ford Anglica")returns “Ford”.
Pair Our Function with dplyr
Consider the mtcars data.
data(mtcars)
head(mtcars, n = 3) mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
. . .
Let’s use our new function:
mtcars |>
rownames_to_column("make_model") |>
mutate(make = find_car_make(make_model),
.after = make_model) |>
head(n = 3) make_model make mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 Mazda 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag Mazda 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 Datsun 710 Datsun 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Recall the penguins Data
library(palmerpenguins)
data(penguins)
penguins |>
head()# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
Function to Standardize Data
We want to take in a vector of numbers and standardize it – make all values be between 0 and 1.
. . .
std_to_01 <- function(var) {
stopifnot(is.numeric(var))
num <- var - min(var, na.rm = TRUE)
denom <- max(var, na.rm = TRUE) - min(var, na.rm = TRUE)
return(num / denom)
}Standardizing Variables
Is it a good idea to standardize (scale) variables in a data analysis?
Why standardize?
- Easier to compare across variables.
- Easier to model – standardizes the amount of variability.
Why not standardize?
- More difficult to interpret the values.
. . .
E.g., a penguin with a bill length of 35 mm (std to 0.11) and a mass of 5500 g (std to 0.78).
Pair Our Function with dplyr
Let’s standardize penguin measurements.
penguins |>
mutate(bill_length_mm = std_to_01(bill_length_mm),
bill_depth_mm = std_to_01(bill_depth_mm),
flipper_length_mm = std_to_01(flipper_length_mm),
body_mass_g = std_to_01(body_mass_g))- Ugh. Still copy-pasting!
. . .
Recall across()!
penguins |>
mutate(across(.cols = bill_length_mm:body_mass_g,
.fns = ~ std_to_01(.x))) |>
slice_head(n = 4)# A tibble: 4 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torgersen 0.255 0.667 0.153 0.292
2 Adelie Torgersen 0.269 0.512 0.237 0.306
3 Adelie Torgersen 0.298 0.583 0.390 0.153
4 Adelie Torgersen NA NA NA NA
# ℹ 2 more variables: sex <fct>, year <int>
Use variables as function arguments?
std_column_01 <- function(data, variable) {
stopifnot(is.data.frame(data))
data <- data |>
mutate(variable = std_to_01(variable))
return(data)
}I used the existing function std_to_01() inside the new function for clarity!
. . .
But it didn’t work…
std_column_01(penguins, body_mass_g)Error in `mutate()`:
ℹ In argument: `variable = std_to_01(variable)`.
Caused by error:
! object 'body_mass_g' not found
Tidy Evaluation
Functions using unquoted variable names as arguments are said to use nonstandard evaluation or tidy evaluation.
Tidy:
penguins |>
pull(body_mass_g)OR
penguins$body_mass_gUntidy:
penguins[, "body_mass_g"]OR
penguins[["body_mass_g"]]. . .
Tidy evaluation isn’t naturally supported when writing your own functions.
Defused R Code
When a piece of code is defused, R doesn’t return its value like normal.
- Instead it returns an expression that describes how to evaluate it.
. . .
Evaluated code:
1 + 1[1] 2
Defused code:
expr(1 + 1)1 + 1
. . .
We produce defused code when we use tidy evaluation and our own functions don’t know how to handle it.
Solution 1
Don’t use tidy evaluation in your own functions.
- This is more complicated to read and use, but it’s safe.
Solution 2: rlang
Use the rlang package!
- This package provides operators that simplify writing functions around
tidyversepipelines.
knitr::include_graphics("https://github.com/rstudio/hex-stickers/blob/main/thumbs/rlang.png?raw=true")- Read more about using this package for function writing here!
Solution 2: rlang
Two ways to get around the issue of defused code:
- Embrace Operator (
{ })
- With
{ }, you can transport a variable from one function to another.
. . .
- Defuse and Inject
- You can first use
enquo(arg)to defuse the variable. - Then use
!!argto inject the variable.
Solution 2: rlang
If we use either of these solutions, we also need to use the walrus operator (:=).
- This means we have to use
:=instead of=in anydplyrverb containing one of theserlangfixes.
Recall Our Broken Function
std_column_01 <- function(data, variable) {
stopifnot(is.data.frame(data))
data <- data |>
mutate(variable = std_to_01(variable))
return(data)
}
std_column_01(penguins, body_mass_g)Error in `mutate()`:
ℹ In argument: `variable = std_to_01(variable)`.
Caused by error:
! object 'body_mass_g' not found
- The code is defused, so
mutate()doesn’t know whatbody_mass_gis. - We need to modify
variableto make this work correctly!
Fixing Our Broken Function
# A tibble: 6 × 7
species island bill_length_mm bill_depth_mm body_mass_g sex year
<fct> <fct> <dbl> <dbl> <dbl> <fct> <int>
1 Adelie Torgersen 39.1 18.7 0.292 male 2007
2 Adelie Torgersen 39.5 17.4 0.306 female 2007
3 Adelie Torgersen 40.3 18 0.153 female 2007
4 Adelie Torgersen NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 0.208 female 2007
6 Adelie Torgersen 39.3 20.6 0.264 male 2007
# A tibble: 6 × 7
species island bill_length_mm bill_depth_mm body_mass_g sex year
<fct> <fct> <dbl> <dbl> <dbl> <fct> <int>
1 Adelie Torgersen 39.1 18.7 0.292 male 2007
2 Adelie Torgersen 39.5 17.4 0.306 female 2007
3 Adelie Torgersen 40.3 18 0.153 female 2007
4 Adelie Torgersen NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 0.208 female 2007
6 Adelie Torgersen 39.3 20.6 0.264 male 2007
Inject Multiple Variables
What if I want to modify multiple columns?
- Use
across()!
# A tibble: 5 × 7
species island bill_length_mm bill_depth_mm body_mass_g sex year
<fct> <fct> <dbl> <dbl> <dbl> <fct> <int>
1 Adelie Torgersen 0.255 0.667 0.292 male 2007
2 Adelie Torgersen 0.269 0.512 0.306 female 2007
3 Adelie Torgersen 0.298 0.583 0.153 female 2007
4 Adelie Torgersen NA NA NA <NA> 2007
5 Adelie Torgersen 0.167 0.738 0.208 female 2007
Missing Data
Types of Missing Data
- Missing Completely at Random (MCAR)
- No difference between missing and observed values.
- Missing observations are a random subset of all observations.
- Missing at Random (MAR)
- Systematic difference between missing and observed values, but can be entirely explained by other observed variables.
- Missing Not at Random (MNAR)
- Missingness is directly related to the unobserved value.
Types of Missing Data
Consider a study of depression.
- Missing Completely at Random (MCAR)
- Some subjects have missing lab values because a batch of samples was processed improperly.
- Missing at Random (MAR)
- Subjects who identify as men are less likely to complete a survey on depression severity.
- Missing Not at Random (MNAR)
- Subjects with more severe depression are less likely to complete a survey on depression severity.
When we remove missing data…
We implicitly assume observations are missing completely at random!
- We might be mostly removing data from subjects who identify as men.
- We might be mostly removing data from subjects with severe depression.
- We are inadvertently making our data less representative.
. . .
We need to take more care when dealing with missing values!
Dealing with Missing Data
- Look for patterns!
- Do observations with missing values have similar traits?
. . .
- Consider outside explanations!
- Why might missing data exist?
- Should we have a “missing” category in our analysis?
. . .
- Can we impute values?
- If depression is MCAR within gender, age, and education level, then the distribution of depression will be similar for people of the same gender, age, and education level.